- Understand the principles and importance of reproducibility in science
- Learn the key steps in producing reproducible research
- Create and detail the structure and value of ‘tidy’ projects, data, and code
Â
  Â
Reproducible:
  Â
Replicable:
  Â
Reproducible: The same result can be independently reached given the same data & analysis pipeline.
  Â
Replicable: The same result can be independently reached given independent data & analysis pipeline.
Â
Â
Principle 1: Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why. (N.B ‘Someone’ includes future-you).
Principle 2: Everything you do, you will probably have to do over again.
Â
Â
Tidy projects
Source: phil_g
Tidy data
(N.B. See handbook for yet more…)
Messy
Source: Wickham (2009) & Pew Research Center
Tidy
Source: Wickham (2009) & Pew Research Center
Messy
Source: Wickham (2009)
Tidy
Source: Wickham (2009)
Microsoft excel through the ages
.xls.xlt.xlm.xlam.xltm.xlsx.xltx...Text through the ages
.txtTypes of text file
.csv: comma-separated values. Great all-purpose format..txt or .tsv: plain-text/tab-delimited.Untidy
Tidy
Good names are
Good names are
Â
! @ # $ % ^ & * ( ) ~ + =
Good names are
Â
_ -
Â
separating_metadata and splitting-up-words
Good names are
Â
data 1.csv
Good names are
Â
2020-08-09_field-data_heights-weights.csv
Good names are
Chronological
2020-08-09_field-data_heights-weights.csv
2020-08-12_field-data_heights-weights.csv
2020-08-18_field-data_heights-weights.csv
Good names are
Logical
01_load_functions.R
02_clean_data.R
03_analysis.R
Or special characters
## cow_ID milk_volume weight ## 1 moo 12 1100 ## 2 bumbo 2 1201 ## 3 spot ? 1084 ## 4 jeffrey 1044 ## 5 holy 16 1244 ## 6 daisy - 1093
Use NA if NA, or 0 if 0
## cow_ID milk_volume weight ## 1 moo 12 1100 ## 2 bumbo 2 1201 ## 3 spot NA 1084 ## 4 jeffrey 0 1044 ## 5 holy 16 1244 ## 6 daisy 0 1093
or a ‘data dictionary’
or a ‘data dictionary’
Hands off!
Modify by hand (only when unavoidable)
Modify via code (whenever possible)
e.g. Naming conventions
MM/DD/YYDD/MM/YYYY/MM/DDDD-MM-YYYYMM-YYInstead, split up the variables:
Or if you must, use the ISO standard: YYYY-MM-DD
Tidy code
Good:
dat_heights_2020 <- read.csv('2020_field_data_heights.csv')
Less good (maybe)
dat_field <- read.csv('2020_field_data_heights.csv')
Bad
dat <- read.csv('2020_field_data_heights.csv')
## ----------------- Load data ----------------- ##
`dat_heights_2020 <- read.csv('2020_field_data_heights.csv')` # Summer 2020
`dat_heights_2021 <- read.csv('2021_field_data_heights.csv')` # Winter 2021
## ----------------- Summarise data ----------------- ##
# Calculate mean +- SD heights
dat_heights_summary %>%
summarise(mean = mean(),
sd = sd(),
n = n())
Good
height <- cm * 6 + mm mean(x, na.rm = TRUE)
Bad
height<-cm*6+mm mean(x,na.rm=TRUE)
Good
do_something_very_complicated( something = "that", requires = many, arguments = "which may be long" )
Bad
do_something_very_complicated("that", requires, many, arguments, "which may be long")
Â
setwd() existsÂ
Bad
`data <- read.csv('C:/tomscomputer/projects/feeding_experiment/data/feeding_data.csv')`
Â
Good
`data <- read.csv('data/feeding_data.csv')`
Also see here::here()
styler::style_file()
Before
height<-cm*6+mm+2; mean(x,na.rm=TRUE)
After
height <- cm * 6 + mm + 2 mean(x, na.rm = TRUE)
Thanks!